Automated transcription and topic segmentation of large spoken archives

نویسندگان

  • Martin Franz
  • Bhuvana Ramabhadran
  • Todd Ward
  • Michael Picheny
چکیده

Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The scale of such archives along with the associated content mark up cost make it impractical to provide access via purely manual means, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use the world’s largest digital archive of video oral histories, collected by the Survivors of the Shoah Visual History Foundation (VHF) to make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR), Natural Language Processing (NLP) and related technologies [1, 2]. This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled with disfluencies, heavy accents, age-related coarticulations, and un-cued speaker and language switching. This paper discusses some of the ASR and NLP tools and technologies that we have been building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics of this corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives

This paper addresses the influence of audio segmentation and segment clustering on automatic transcription accuracy for large spoken archives. The work forms part of the ongoing MALACH project, which is developing advanced techniques for supporting access to the world’s largest digital archive of video oral histories collected in many languages from over 52000 survivors and witnesses of the Hol...

متن کامل

A new quality measure for topic segmentation of text and speech

The recent proliferation of large multimedia collections has gathered immense attention from the speech research community, because speech recognition enables the transcription and indexing of such collections. Topicality information can be used to improve transcription quality and enable content navigation. In this paper, we give a novel quality measure for topic segmentation algorithms that i...

متن کامل

Off-Topic Spoken Response Detection with Word Embeddings

In this study, we developed an automated off-topic response detection system as a supplementary module for an automated proficiency scoring system for non-native English speakers’ spontaneous speech. Given a spoken response, the system first generates an automated transcription using an ASR system trained on non-native speech, and then generates a set of features to assess similarity to the que...

متن کامل

Towards automatic transcription of large spoken archives - English ASR for the MALACH project

Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The NSF-funded MALACH project aims to provide improved access to large spoken archives by advancing the state-of-the-art in automated speech recognition (ASR), Information Retrieval (IR) and related technologies [1, 2] for mu...

متن کامل

Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments

In this paper we describe the initial stages of the ASR component of the MALACH (Multilingual Access to Large Spoken Archives) project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) by advancing the state of the art in automated speech recognition. In order to train the ASR s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003